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Abstract 

Metric learning makes it plausible to learn distances for complex distributions of data from labeled 
data. However, to date, most metric learning methods are based on a single Mahalanobis metric, which 
cannot handle heterogeneous data well. Those that learn multiple metrics throughout the space have 
demonstrated superior accuracy, but at the cost of computational efficiency. Here, we take a new angle to 
the metric learning problem and learn a single metric that is able to implicitly adapt its distance function 
throughout the feature space. This metric adaptation is accomplished by using a random forest-based 
classifier to underpin the distance function and incorporate both absolute pairwise position and standard 
relative position into the representation. We have implemented and tested our method against state of the 
art global and multi-metric methods on a variety of data sets. Overall, the proposed method outperforms 
both types of methods in terms of accuracy (consistently ranked first) and is an order of magnitude faster 
than state of the art multi-metric methods (16x faster in the worst case). 



1 Introduction 

Although the Euclidean distance is a simple and convenient metric, it is often not an accurate representation 



of the underlying shape of the data [ Frome et al.| 2006]. Such a representation is crucial in many real- world 
applications [Boiman et al. 2008 Yang et aL| 201 1 1, such as object classification [ Fink[ 2005 1 Frome et al. 



2007 ], text document retrieval [Lebanon 2006 Wang et aL| 2010 1 and face verification [ Chopra et al.[ |2005 , 



|Nguyen and Bai 201 1 ], and methods that learn a distance metric from training data have hence been widely 
studied in recent years. We present a new angle on the metric learning problem based on random forests 
| Amit and Geman[ 1997 Breiman| 2001 1 as the underlying distance representation. The emphasis of our 
work is the capability to incorporate the absolute position of point pairs in the input space without requiring 
a separate metric per instance or exemplar. In doing so, our method, called random forest distance (RFD), is 
able to adapt to the underlying shape of the data by varying the metric based on the position of sample pairs 
in the feature space while maintaining the efficiency of a single metric. In some sense, our method achieves 
a middle-ground between the two main classes of existing methods — single, global distance functions and 
multi-metric sets of distance functions — overcoming the limitations of both (see Figure [Tj for an illustrative 
example). We next elaborate upon these comparisons. 

The metric learning literature has been dominated by methods that learn a global Mahalanobis metric, with 
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Hoi et al. 
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Shi et al.| 201 1| Weinberger and Saul 


2009 Xing et al. 


2003 


. In brief, given a set 



of pairwise constraints (either by sampling from label data, or collecting side information in the semi- 
supervised case), indicating pairs of points that should or should not be grouped (i.e., have small or large 
distance, respectively), the goal is to find the appropriate linear transformation of the data to best satisfy 
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Figure 1: An example using a classic swiss roll data set comparing both global and position-specific 
Mahalanobis-based methods with our proposed method, RFD. All methods, including the baseline Eu- 
clidean, perform well at low k-values due to local linearity. However, as k increases and the global 
nonlinearity of the data becomes important, the monolithic methods ' inability to incorporate position in- 
formation causes their performance to degrade until it is little better than chance. The position-specific ISD 
method performs somewhat better, but even with a Mahalanobis matrix at every point it is unable to capture 
the globally nonlinear relations between points. Our method, by comparison, shows no degradation as k 
increases. ( 3 classes, 900 samples, validated using k-nearest neighbor classification, with varying k ) 



these constraints. One such method [Xing et aL| 2003 1 minimizes the distance between positively-linked 
points subject to the constraint that negatively-linked points are separated, but requires solving a compu- 



tationaly expensive semidefinite programming problem. Relevant Component Analysis (RCA) |Bar-Hillel 



et al.| 2003 1 learns a linear Mahalanobis transformation to satisfy a set of positive constraints. Discrimi- 



nant Component Analysis (DCA) [Hoi et al 2006[ extends RCA by exploring negative constraints. ITML 



| Davis et aL] |2007[ minimizes the LogDet divergence under positive and negative linear constraints, and 



LMNN | Shen et al. 2010 Weinberger and Saul 2009] learns a distance metric through the maximum mar- 



gin framework. [Nguyen and Guo 2008 ] formulate metric learning as a quadratic semidefinite programming 
problem with local neighborhood constraints and linear time complexity in the original feature space. More 
recently, researchers have begun developing fast algorithms that can work in an online manner, such as 
POLA UShalev-Shwartz et al.j [20041 , MLCL IGloberson and Roweis| [2006| and LEGO |Jain et al.[|2008l . 

These global methods learn a single Mahalanobis metric using the relative position of point pairs: 
Dist (xj,Xj) = (xj — Xj) T W(xj — Xj). Although the resulting single metric is efficient, it is limited in 
its capacity to capture the shape of complex data. In contrast, a second class, called multi-metric methods, 
distributes distance metrics throughout the input space; in the limit, they estimate a distance metric per 
instance or exemplar, e.g., |Frome et al. 2006 2007) for the case of Mahalanobis metrics. |Zhan et al. 



2009 ] extend [[Frome et al. 2006 1 by propagating metrics learned on training exemplars to learn a matrix for 
each unlabeled point as well. However, these point-based multi-metric methods all suffer from high time 
and space complexity due to the need to learn and store O (n) d by d metric matrices. A more efficient 



approach to this second class is to divide the data into subsets and learn a metric for each subset [Babenko 
et al.[ 2009[ Weinberger and Saul 2008]. However, these methods have strong assumptions in generating 



these subsets; for example, | Babenko et al. 2009 1 learns at most one metric per category, forfeiting the 
possibility that different samples within a category may require different metrics. 

We propose a metric learning method that is able to achieve both the efficiency of the global methods and 
specificity of the multi-metric methods. Our method, the random forest distance (RFD), transforms the 
metric learning problem into a binary classification problem and uses random forests as the underlying 
representation | |Amit and Geman[ |1997[ |Biau and Devroye[ |2010[ |Breiman[ |2001[ |Leistner et al] |2009| . 
In this general form, we are able to incorporate the position of samples implicitly into the metric and yet 



maintain a single and efficient global metric. To that end, we use a novel point-pair mapping function that 
encodes both the position of the points relative to each other and their absolute position within the feature 
space. Our experimental analyses demonstrate the importance of incorporating position information into the 
metric (Section[3]). 

We use the random forest as the underlying representation for several reasons. First, the output of the 
random forest algorithm is a simple "yes" or "no" vote from each tree in the forest. In our case, "no" votes 
correspond to positively constrained training data, and "yes" votes correspond to negatively constrained 
training data. The number of yes votes, then, is effectively a distance function, representing the relative 
resemblance of a point pair to pairs that are known to be dissimilar versus pairs that are known to be similar. 
Second, random forests are efficient and scale well, and have been shown to be one of the most powerful and 
scalable supervised methods for handling high-dimensional data |Caruana and NicuTe scu-Mizil, 2006] — in 
contrast to instance-specific multi-metric methods | Frome et al.[|2006j|2007| , the storage requirement of our 
method is independent of the size of the input data set. Our experimental results indicate RFD is at least 16 
times faster than the state of the art multi-metric method. Third, because random forests are non-parametric, 
they make minimal assumptions about the shape and patterning of the data |Breiman 2001], affording a 
flexible model that is inherently nonlinear. In the next section, we describe the new RFD method in more 
detail, followed by a thorough comparison to the state of the art in Section[3] 

2 Random Forest Distance: Implicitly Position-Dependent Metric Learning 



Our random forest-based approach is inspired by several other recent advances in metric learning [Babenko 



et al.[ 2009 Shalev-Shwartz et"aL| 2004J that reformulate the metric learning problem into a classification 



problem. However, where these approaches restricted the form of the learned distance function to a Maha- 
lanobis matrix, thus precluding the use of position information, we adopt a more general formulation of the 
classification problem that removes this restriction. 

Given the instance set X = {xi, X2, • • • , Xjv}, each Xj E R m is a vector of m features. Taking a geometric 
interpretation of each Xj, we consider Xj the position of sample i in the space It m . The value of this 
interpretation will become clear throughout the paper as the learned metric will implicitly vary over IR m , 
which allows it to adapt the learned metric based on local structure in a manner similar to the instance- 
specific multi-metric methods, e.g., [ Frome et al.||2006 |. Denote two pairwise constraint sets: a must-link 
constraint set S = {(xj,Xj)|xj and Xj are similar} and a do-not-link constraint set D = {(xj,Xj)|xj and 
Xj are dissimilar}. For any constraint (xj,Xj), denote yij as the ideal distance between Xj and Xj. If 
(xj, Xj) 6 S, then the distance y^ = 0, otherwise y^ = 1. Therefore, we seek a function Dist (•, •) from an 
appropriate function space H: 

Dist (■,•)*= argmin 1 V] K Dist ( x *> x i)> Wi) > W 

where Z(-) is some loss function that will be specified by the specific classifier chosen. In our random forests 
case, we minimize expected loss, as in many classification problems. So consider Dist (•, •) to be a binary 
classifier for the classes and 1. For flexibility, we redefine the problem as Dist (xj, Xj) = F(0(xj, x^)), 
where F(-) is some classification model, and <^>(xj, Xj) is a mapping function that maps the pair (xj, Xj) to a 
feature vector that will serve as input for the classifier function F. To train F, we transform each constraint 
pair using the mapping function {(xj, Xj), y^} — > {0(xj, Xj), y^} and submit the resulting set of vectors 
and labels as training data. We next describe the feature mapping function 4>. 

2.1 Mapping function for implicitly position-dependent metric learning 

In actuality, all metric learning methods implicitly employ a mapping function 0(x,,Xj). However, Maha- 
lanobis based methods are restricted in terms of what features their metric solution can encode. These meth- 
ods all learn a (positive semidefinite) metric matrix W, and a distance function of the form Dist (x$, Xj) = 
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Table 1: U CI data sets used for KNN -classification testing 



Dataset 


Size 


Dim. 


No. Classes 


Dataset 


Size 


Dim. 


No. Classes 


Balance 


625 


4 


3 


Iris 


150 


4 


3 


BUPA Liver Disorders 


345 


6 


2 


Pima Indians Diabetes 


768 


8 


2 


Breast Cancer 


699 


10 


2 


Wine 


178 


13 


3 


Image Segmentation 


2310 


19 


7 


Sonar 


208 


60 


2 


Semeion Handwritten Digits 


1593 


256 


10 


Multiple Features 
Handwritten Dig- 
its 


2000 


649 


10 



(xj — Xj) T W(x.j — Xj), which can be reformulated as Dist (xj, Xj) = [W] T [(xj — Xj)(xj — x-,-) T ], where 
[•] denotes vectorization or flattening of a matrix. Mahalanobis-based methods can thus be viewed as using 
the mapping function </>(xj,Xj) = [(xj — Xj)(x, — Xj) T ]. This function encodes only relative position 
information, and the Mahalanobis formulation allows the use of no other features. 

However, our formulation affords a more general mapping function: 
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which considers both the relative location of the samples u as well as their absolute position v. The output 
feature vector is the concatenation of these two and in R 2m . 

The relative location u represents the same information as the Mahalanobis mapping function. Note, we 
take the absolute value in u to enforce symmetry in the learned metric. The primary difference between our 
mapping function and that of previous methods is thus the information contained in v — the mean of the two 
point vectors. It localizes each mapped pair to a region of the space, which allows our method to adapt to 
heterogeneous distributions of data. It is for this reason that we consider our learned metric to be implicitly 
position-dependent. Note the earlier methods that learn position-based metrics, i.e. the methods that learn a 
metric per instance such as [ jFrome efaL 2006[ , incorporate absolute position of each instance only, whereas 
we incorporate the absolute position of each instance pair, which adds additional modeling versatility. 

We note that alternate encodings of the position information are possible but have shortcomings. For exam- 
ple, we could choose to simply concatenate the position of the two points rather than average them, but this 
approach raises the issue of ordering the points. Using v = [xj xj] would again yield a nonsymmetric 
feature, and an arbitrary ordering rule would not guarantee meaningful feature comparisons. The usefulness 
of position information varies depending on the data set. For data that is largely linear and homogenous, in- 
cluding v will only add noise to the features, and could worsen the accuracy. In our experiments, we found 
that for many real data sets (and particularly for more difficult data sets) the inclusion of v significantly 
improves the performance of the metric (see Section [3]). 

2.2 Random forests for metric learning 

Random forests are well studied in the machine learning literature and we do not describe them in any detail; 
the interested reader is directed to [Am it and Geman[|1997||Breiman[ |2001 1. In brief, a random forest is a 
set of decision trees {ft}f = i operating on a common feature space, in our case R 2m . To evaluate a point- 
pair (xj,Xj), each tree independently classifies the sample (based on the leaf node at which the point-pair 
arrives) as similar or dissimilar (0 or 1, respectively) and the forest averages them, essentially regressing a 
distance measure on the point-pair: 



Dist(xj,x,-) = F((j)(xi,Xj)) 



1 T 
T 



(3) 
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where f t (-) is the classification output of tree t. 

It has been found empirically that random forests scale well with increasing dimensionality, compared with 
other classification methods [Caruana and Niculescu-Mizill |2006], and, as a decision tree-based method, 



they are inherently nonlinear. Hence, our use of them in RFD as a regression algorithm allows for a more 
scalable and more flexible metric than is possible using Mahalanobis methods. Moreover, the incorporation 



of position information into this classification function (as described in Section 2. 1 1 allows the metric to 
implicitly adapt to different regions over the feature space. In other words, when a decision tree in the 
random forest selects a node split based on a value of the absolute position v sub-vector (see Eq. [2]), then 
all evaluation in the sub-tree is localized to a specific half-space of R m . Subsequent splits on elements of v 
further refine the sub-space of emphasis R m . Indeed, each path through a decision tree in the random forest 
is localized to a particular (possibly overlapping) sub-space. 

The RFD is not technically a metric but rather a pseudosemimetric. Although RFD can easily be shown to be 
non-negative and symmetric, it does not satisfy the triangle inequality (i.e., Dist(xi, X2) < Dist(xi, X3) + 
Dist(x2,X3)) or the implication that Dist(xi,X2) = xi = X2, sometimes called identity of 

indiscernibles. It is straightforward to construct examples for both of these cases. Although this point 
may appear problematic, it is not uncommon in the metric learning literature. For example, by necessity, 
no metric whose distance function varies across the feature space can guarantee the triangle inequality is 



satisfied. [ Frome et al.[ 2006 2007 1 similarly cannot satisfy the triangle inequality. Our method must 



violate the triangle inequality in order to fulfill our original objective of producing a metric that incorporates 
position data. Moreover, our extensive experimental results demonstrate the capability of RFD as a distance 
(Section |3). 

3 Experiments and Analysis 

In this section, we present a set of experiments comparing our method to state of the art metric learning 
techniques on both a range of UCI data sets (Table [TJ and an image data set taken from the Corel database. 
To substantiate our claim of computational efficiency, we also provide an analysis of running time efficiency 
relative to an existing position-dependent metric learning method. 

For the UCI data sets, we compare performance at the fc-nearest neighbor classification task against both 
standard Mahalanobis methods and point-based position-dependent methods. For the former, we test fc-NN 
classification accuracy at a range of k-values (as in Figure [TJ, while the latter relies on results published 
by other methods' authors, and thus uses a fixed k. For the image data set, we measure accuracy at fc-NN 
retrieval, rather than k-NN classification. We compare our results to several Mahalanobis methods. 

The following is an overview of the primary experimental findings to be covered in the following sections. 



1 . RFD has the best overall performance on ten UCI data sets ranging from 4 to 649 dimensions against 
four state of the art and two baseline global Mahalanobis-based methods (Figure|2]and Table[2]). 

2. RFD has comparable or superior accuracy to state of the art position-specific methods (Table [3J. 

3. RFD is 16 to 85 times faster than the state of the art position-specific method (TableQ. 

4. RFD outperforms the state of the art in nine out of ten categories in the benchmark Corel image 
retrieval problem (Figure [4]). 



3.1 Comparison with global Mahalanobis metric learning methods 

We first compare our method to a set of state of the art Mahalanobis metric learning methods: RCA |Bar-| 



Hillel et al.[ [2003| , DC A | |Hoi et al.[ |2"006], Information-Theoretic Metric Learning (ITML)[ D avis et al.[ 



5 



■-■Euclidean ---Mahalanobis -- RCA "DCA ---ITML LMNN — RFD(-P) RFD(+P) 



Multiple Features Handwritten Digits 



Semeion 




a *"s. B - '"■ - 



5 20 35 50 65 80 95 110 125 140 155 170 185 200 



5 15 25 35 45 55 65 75 85 95 105115125135145155 



Sonar 



Segmentation 



0.81 
0.76 
0.71 
0.66 
0.61 
0.56 




0.98 
0.88 - 
0.78 - 
0.68 - 



5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 97 



Wine 



5 25 45 65 85 105 125 145 165 185 205 225 245 265 285 

Breast 



1 

0.98 
0.96 
0.94 
0.92 

0.9 
0.88 
0.86 
0.84 
0.82 

0.8 



1 

0.95 - 

0.9 - 
0.85 

0.8 
0.75 

0.7 
0.65 

0.6 



■""Hi=»..-f S W 



5 10 15 20 25 30 35 40 45 48 

Diabetes 



5 20 35 50 65 



95 110125140155170 185200215230 



BUPA 



0.78 
0.76 
0.74 
0.72 
0.7 
0.68 
0.66 
0.64 





5 25 45 65 85 105 125 145 165 185 205 225 245 265 



5 15 25 35 45 55 65 75 85 95 105 115 125 135 145 



Iris 



Balance 



0.98 
0.96 
0.94 
0.92 

0.9 
0.88 
0.86 
0.84 
0.82 

0.8 



... 



•X" — 



10 15 20 25 30 35 40 45 50 




10 15 20 25 30 35 40 45 



Figure 2: k-nearest neighbor classification results with varying k values ofRFD versus assorted global Ma- 
halanobis methods on 10 UCI data sets . Plots show k-nearest neighbor k-value versus accuracy. Note in 
particular the segmentation and breast datasets, where RFD shows little or no degradation over increasing 
distances, while other methods steadily decline in accuracy. Also note that the inclusion of position infor- 
mation in the RFD yields higher performance on all but the low-dimensional and highly linear iris dataset. 
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2007 ] and distance metric learning for large-margin nearest neighbor classification (LMNN) | Shen et al. 
2010 Weinberger and Saul| 2009 1. For our method, we test using the full feature mapping including relative 
position data, u, and absolute pairwise position data, v, (RFD (+P)) as well as with only relative position 
data, u, (RFD (— P)). To provide a baseline, we also show results using both the Euclidean distance and a 
heuristic Mahalanobis metric, where the W used is simply the covariance matrix for the data. All algorithm 
code was obtained from authors' websites, for which we are indebted (our code is available on \http:\ 
\//www. cse.buffalo. edu/ ~ jcorso) . 

We test each algorithm on a number of standard small to medium scale UCI data sets (see Table [T}. All 
algorithms are trained using 1000 positive and 1000 negative constraints per class, with the exceptions of 
RCA, which used only the 1000 positive constraints and LMNN, which used the full label set to actively 
select a (generally much larger) set of constraints; constraints are all selected randomly according to a 



uniform distribution. In each case, we set the number of trees used by our method to 400 (see Section 3.2 
for a discussion of the effect of varying forest sizes). 

Testing is performed using 5 -fold cross validation on the k nearest-neighbor classification task. Rather than 
selecting a single fc-value for this task, we test with varying ks, increasing in increments of 5 up to the 
maximum possible value for each data set (i.e. the number of elements in the smallest class). By varying 
k in this way, we are able to gain some insight into each method's ability to capture the global variation 
in a data set. When k is small, most of the identified neighbors lie within a small local region surrounding 
the query point, enabling linear metrics to perform fairly well even on globally nonlinear data by taking 
advantage of local linearity. However, as k increases, local linearity becomes less practical, and the quality 
of the metric's representation of the global structure of the data is exposed. Though the accuracy results 
at higher k values do not have strong implications for each method's efficacy for the specific task of k- 
NN classification (where an ideal k value can just be selected by cross-validation), they do indicate overall 
metric performance, and are highly relevant to other tasks, such as retrieval. 

Figure [2] show the accuracy plots for ten UCI datasets. RFD is consistently near the top performers on these 
various data sets. In the lower dimension case (Iris), most methods perform well, and RFD without position 
information outperforms RFD with position information (this is the sole data set in which this occurs), which 
we attribute to the limited data set size (150 samples) and the position information acting as a distractor in 
this small and highly linear case. In all other cases, the RFD with absolute position information significantly 
outperforms RFD without it. In many of the more more difficult cases (Diabetes, Segmentation, Sonar), 
RFD with position information significantly outperforms the field. This result is suggestive that RFD can 
scale well with increasing dimensionality, which is consistent with the findings from the literature that 



random forests are one of the most robust classification methods for high-dimensional data | Caruana and 
Niculescu-Mizil| |2006 1 . 



Table [2]provides a summary statistic of the methods by computing the mean-rank (lower better) over the ten 
data sets at varying fc-values. For all but one value of k, RFD with absolute position information has the 
best mean rank of all the methods (and for the off-case, it is ranked a close second). RFD without absolute 
position information performs comparatively poorer, underscoring the utility of the absolute position infor- 
mation. In summary, the results in Table [2] show that RFD is consistently able to outperform the state of the 
art in global metric learning methods on various benchmark problems. 

3.2 Varying forest size 

One question that must be addressed when using RFD is how many trees must or should be learned in order 
to obtain good results. Increasing the size of the forest increases computation and space requirements, and 
past a certain point yields little or no improvement and may possibly over-train. It is beyond the scope 
of this paper to provide a full answer as to how many trees are needed in RFD, but we have made some 
observations. 
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Number of Nearest Neighbors 

Figure 3: Effect of forest size on RFD performance on the UCI diabetes data set. Results were obtained by 
averaging results from 10 runs, each using 5 -fold cross validation. Both with and without position informa- 
tion, increasing forest size yields notable improvements in accuracy up to about 100 trees. If no position 
information is included, then additional trees beyond this point provide modest gains at best. With position 
information, larger forests do appear to allow more fine-tuning, and can produce noticable improvements 
up to at least 500 trees. 



First, the addition of absolute position information noticeably increases the benefit that may be obtained 
from additional trees (see Figure[3]>. This result is unsurprising, considering the increased size of the feature 
vector, as well as the increased degree of fine-tuning possible for a metric that can vary from region to 
region. Second, in our experiments we observe significant improvements in accuracy up to about 100 trees, 
even without position information, and would recommend this as a reasonable minimum value. It seems 
reasonable that larger constraint-sets will require larger forests, and similarly, the more complex the shape 
of the data, the larger the forest may need to be. But, these two points have not yet been thoroughly explored 
by our group. 

3.3 Comparison with position-specific multi-metric methods 

We compare our method to three multi-metric methods that incorporate absolute position (via instance- 



specific metrics): FSM, FSSM and ISD. FSM |Frome et al. 2006] learns an instance-specific distance for 
each labeled example. FSSM prome et al. 2007 1 is an extension of FSM that enforces global consistency 
and comparability among the different instance-specific metrics. ISD [Zhan et aL]|2009| first learns instance- 
specific distance metrics for each labeled data point, then uses metric propagation to generate instance- 
specific metrics for unlabeled points as well. 

We again use the ten UCI data sets, but under the same conditions used by these methods' authors. Accuracy 
is measured on the fc-NN task (k=ll) with three-fold cross validation. The parameters of the compared 



methods are set as suggested in [Zh an et al.||2009| l. Our RFD method chooses 1% of the available positive 
constraints and 1% of the available negative constraints, and constructs a random forest with 1000 trees. 
We report the average result of ten different runs on each data set, with random partitions of training/testing 



8 



Table 2: Mean k-nearest neighbor classification accuracy ranking on 10 U CI data sets at varying k values 
(lower rank is better). The mean ranking is shown in each table cell as well as its rank, in parentheses; i.e., 
for k of 5, RFD ( +P) has a mean rank of 2.9, the number 1 mean rank. As expected Euclidean always has 
the worst rank. RFD with absolute position information attains the best rank in nearly all cases, and the 
relative performance of both RFD methods improves as k increases. 



fc-value 


Euclid 


Mahal 


RCA 


DCA 


ITML 


LMNN 


RFD (-P) 


RFD (+P) 


5 


5.8 (8) 


5.7 (7) 


4.3 (4) 


4.8 (5) 


3.9 (3) 


3.2 (2) 


5.4 (6) 


2.9 (1) 


10 


6.1 (8) 


5.6 (7) 


3.7 (3) 


4.6 (4) 


4.8 (5) 


2.9 (1) 


5.1 (6) 


3.2 (2) 


15 


5.7 (8) 


5.4 (6) 


3.9 (3) 


4.7 (5) 


5.6 (7) 


3.1 (2) 


4.6 (4) 


3(1) 


20 


5.6 (8) 


5.4 (7) 


3.8 (3) 


5.2 (5) 


5.3 (6) 


3.7 (2) 


4.5 (4) 


2.5 (1) 


25 


6.1 (8) 


5.3 (6) 


4(3) 


4.5 (4) 


5.4 (7) 


3.4 (2) 


4.8 (5) 


2.5 (1) 


30 


5.8 (7) 


5.9 (8) 


4.5 (5) 


4.3 (3) 


5.3 (6) 


3.5 (2) 


4.3 (3) 


2.4 (1) 


35 


5.8 (8) 


5.4 (6) 


4.3 (4) 


4.9 (5) 


5.5 (7) 


4(3) 


3.8 (2) 


2.3 (1) 


45 


6.6 (8) 


5.5 (6) 


4.4 (4) 


4.4 (4) 


5.9 (7) 


3.3 (2) 


4.1 (3) 


1.8 (1) 


Max 


6.5 (8) 


6.1(7) 


5.1 (5) 


3.7 (3) 


5.5 (6) 


3.7 (3) 


3.5 (2) 


1.9 (1) 



Table 3: Comparison of test error (mean + STDjfor position-dependent metric learning methods. The best 
performance on each data set is shown in bold. We note that our method yields the best accuracy on 3 out 
of 5 data sets tested, and is within 1% of the best on the remaining 2. 



Dataset 


RFD 


ISDL1 


ISDL2 


FSM 


FSSM 


Balance 

Diabetes 

Breast(Scaled) 

German 

Haberman 


.120±.024 
.241±.028 
.030±.011 
.277±.039 
.273±.029 


.114±.013 

.287±.019 
.0.31±.010 
.277±.015 
.277±.029 


.116±.014 
.269±.023 
.030±.010 
.274±.013 
.273±.025 


0.134±.020 
.342±.050 
.102±.041 
.275±.021 
.276±.032 


0.143±.013 
.322±.232 
.112±.029 

0.275±.060 
.276±.029 



Table 4: Run-time comparison oflSD and RFD (with position information, using 1000 trees) across several 
UCI data sets. All times are in seconds. Results were obtained by performing 5 -fold cross validation and 
averaging the time for each fold. *Note that ISD is multithreading across 12 cores, while our implementation 
of RFD is fully sequential. 



Dataset 


ISD Time* 


RFD Time 


ISD:RFD Ratio 


Iris 


34.6 


2.1 


16.4 


Balance 


620.3 


11.2 


55.3 


Breast (scaled) 


657.4 


7.8 


84.6 


Diabetes 


849.5 


14.7 


57.8 
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Butterflies Cats 



Dogs Mountains Penguins Roses Sunsets 
■ RFD ■ DCA □ ITML ■ Euclidean 



Owl Cougar Fish 



Figure 4: Average retrieval precision on top 20 nearest neighbors of images in the Corel data set. RFD 
outperforms DCA, ITML and the baseline Euclidean measure on all but one category. 



data generated each time (see Table [3]). These results show that our RFD method yields performance better 
than or comparable to state of the art explicitly multi-metric learning methods. Additionally, because we 
only learn one distance function and random forests are an inherently efficient technique, our method offers 
significantly better computational efficiency than these instance-specific approaches (see TableQ — between 
16 to 85 times faster than ISD. 

The comparable level of accuracy is not surprising. While our method is a single metric in form, in practice 
its implicit position-dependence allows it to act like a multi-metric system. Notably, because our method 
learns using the position of each point-pair rather than each point, it can potentially encode up to n 2 implicit 
position-specific metrics, rather than the O(n) learned by existing position-dependent methods, which learn 
a single metric per instance/position. RFD is a stronger way to learn a position-dependent metric, because 
even explicit multi-metric methods will fail over global distances in cases where a single (Mahalanobis) 
metric cannot capture the relationship between its associated point and every other point in the data. 

3.4 Retrieval on the Corel image data set 

We also evaluate our method's performance on the challenging image retrieval task because this task differs 
from A;-NN classification by emphasizing the accuracy of individual pairwise distances rather than broad 
patterns. For this task, we use an image data set taken from the Corel image database. We select ten image 
categories of varying types (cats, roses, mountains, etc. — the classes and images are similar to those used 
by Hoi et al. to validate DCA |Hoi et aL]|2006[ ), each with a clear semantic meaning. Each class contains 
100 images, for a total of 1000 images in the data set. 

For each image, we extract a 36-dimensional low-level feature vector comprising color, shape and texture. 
For color, we extract mean, variance and skewness in each HSV color channel, and thus obtain 9 color 
features. For shape, we employ a Canny edge detector and construct an 18 -dimensional edge direction his- 
togram for the image. For texture, we apply Discrete Wavelet Transformation (DWT) to graylevel versions 
of original RGB images. A Daubechies-4 wavelet filter is applied to perform 3-level decomposition, and 
mean, variance and mode of each of the 3 levels are extracted as a 9-dimensional texture feature. 

We compare three state of the art algorithms and a Euclidean distance baseline: ITML, DCA, and our RFD 
method (with absolute position information). For ITML, we vary the parameter 7 from 10 -4 to 10 4 and 
choose the best (10 -3 ). For each method, we generate 1% of the available positive constraints and 1% of 



the available negative constraints (as proposed in [Hoi et al. 2006]). For RFD, we construct a random forest 
with 1500 trees. Using five-fold cross validation, we retrieve the 20 nearest neighbors of each image under 
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each metric. Accuracy is determined by counting the fraction of the retrieved images that are the same 
class as the image that retrieved them. We repeat this experiment 10 times with differing random folds and 
report the average results in Figure [4] RFD clearly outperforms the other methods tested, achieving the best 
accuracy on all but the cougar category. Also note that ITML performs roughly on par with or worse than the 
baseline on 7 classes, and DC A on 5, while RFD fails only on 1, indicating again that RFD provides a better 
global distance measure than current state of the art approaches, and is less likely to sacrifice performance 
in one region in order to gain it in another. 

4 Conclusion 

In this paper, we have proposed a new angle to the metric learning problem. Our method, called random 
forest distance (RFD), incorporates both conventional relative position of point pairs as well as absolute 
position of point pairs into the learned metric, and hence implicitly adapts the metric through the feature 
space. Our evaluation has demonstrated the capability of RFD, which has best overall performance in terms 
of accuracy and speed on a variety of benchmarks. 

There are immediate directions of inquiry that have been paved with this paper. First, RFD further demon- 
strates the capability of classification methods underpinning metric learning. Similar feature mapping func- 
tions and other underlying forms for the distance function need to be investigated. Second, the utility of 
absolute pairwise position is clear from our work, which is a good indication of the need for multiple met- 
rics. Open questions remain about other representations of the position as well as the use of position in other 
metric forms, even the classic Mahalanobis metric. Third, there are connections between random forests 
and nearest-neighbor methods, which may explain the good performance we have observed. We have not 
explored them in any detail in this paper and plan to in the future. Finally, we are also investigating the use 
of RFD on larger-scale, more diverse data sets like the new MIT SUN image classification data set. 
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